Bias-Complexity Trade-off
Error Decomposition
To put it in the simplest fashion:
LD(hS)=ϵapp+ϵestwhere:ϵapp=h∈HminLD(h). Enlarging the hypothesis class will decrease the approximation error, but might increase estimation error, as a rich H might lead to overfitting. On the other hand, chooseing H to be a very small set reduces the estimation error but might increase the approximation error or, in other words, might lead to underfitting.
The No-Free-Lunch Theorem
Let A be any learning algorithm for binary classification over X and let m≤∣X∣/2. Then there exists a distribution D over X×{0,1} such that:
- There exists a function f:X→{0,1}withLD(f)=0
- With probability of at least 1/7 over the choice of S∼Dm we have that LD(A(S))≥1/8.
Proof
Let the set C of size 2m be given, where C is a subset of the domain: C∈X2m.
Next, consider all possible T=22m labelling functions which map from C→{0,1}, where ∣C∣=2m.
For each fi, define a distribution Di such that Di
is uniform over C, with labels given by fi. Denote Dim to be the distribution of m i.i.d. samples from Di.
Clearly, LDi(fi)=0.
Next, we want to show:
i∈[T]maxES∼Dim[LDi(A(S))]≥1/4. Side track
Show that this inequality implies ∃i∈[T] such that with probability at least 1/7 over training set S∼Dim, we have LDi(A(S))≥1/8.
Proof (using Markov's inequality)
P(X≥a)≥1−aE[X]−a with X=LDi(A(S)), and a=1/8.
Back on track. There are k=(2m)m possible sequences of m examples from C. Denote these sequences by S1,⋯,Sk. Also, if by the function fi. Therefore,
ES∼Dim[LDi(A(S))]=k1j=1∑kLDi(A(Sji)). Using the facts that max is greater than average is greater than min, we have
k1j=1∑kLDi(A(Sji))≥T1i=1∑Tk1j=1∑kLDi(A(Sji))=k1j=1∑kT1i=1∑TLDi(A(Sji))≥j∈[k]minT1i=1∑TLDi(A(Sji)). Next, fix any Sj of size m. Then there are p≥m samples v1,⋯,vp∈C that do not appear in S. Clearly, p≥m. Therefore, we have
LDi(h)=2m1x∈C∑1[h(x)=fi(x)]≥2m1r=1∑p1[h(vr)=fi(vr)]≥2p1r=1∑p1[h(vr)=fi(vr)]. Hence,
T1i=1∑TLDi(A(Sji))≥T1i=1∑T2p1r=1∑p1[A(Sji)(vr)=fi(vr)]=2p1r=1∑pT1i=1∑T1[A(Sji)(vr)=fi(vr)]≥21r∈[p]minT1i=1∑T1[A(Sji)(vr)=fi(vr)]. Partition fi into T/2 pairs, where each pair (fi,fi′) agrees on everything except vl. Every pair (fi,fi′) produces same labelled datasets Sji,Sji′.
1[A(Sji)(vr)=fi(vr)]+1[A(Sji)(vr)=fi′(vr)]T1i=1∑T1[A(Sji)(vr)=fi(vr)]=21. We can complete the proof with the following
T1i=1∑TLDi(A(Sji))≥21T12T=41. Corollary
Let X be an infinite domain set (Rd) and let H be all possible functions from X→{0,1}. Then, H is not PAC-learnable.
Proof
Assume, by way of contradiction, that the class is learnable. Choose some ϵ<1/8 and δ<1/7. By the definition of PAC learnability, there must be some learning algorithm A and an integer m=m(ϵ,δ), such that for any data-generating distribution over X×{0,1}, if for some function f:X→{0,1},LDf=0, then with probability greater than 1−δ when A is applied to samples S of size m, generated i.i.d. by D,LD(A(S))≤ϵ. However, applying the No-Free-Lunch theorem, since ∣X∣>2m, for every learning algorithm, there exists a distribution D such that with probability greater than 1/7>δ,LD(A(S))>1/8>ϵ, which leads to the desired contradiction.